4 Main Analysis (Exploratory Data Analysis)
4.1 Summary Measure of Health
We remove the Na’s and take the average over counties’ value for each state. The bar charts visualize the ordering of the amount in all four factors crossing 50 states:
summary_measure_df1 <- summary_measure_df1[complete.cases(summary_measure_df1), ]
summary_measure_state <- summary_measure_df1%>%
group_by(CHSI_State_Abbr) %>%
summarise(meanALE = mean(ALE,rm.na=TRUE), mAD = mean(All_Death),mHS= mean(Health_Status), mUD =mean(Unhealthy_Days))%>%
mutate(meanALE = meanALE, meanAll_Death = mAD, meanHealth_Status = mHS, meanUnhealthy_Days=mUD)
# Average Life Expectancy — This represents the average number of years that a baby born in 1990 is expected to live if current mortality trends continue to apply.
ggplot(summary_measure_state, aes(reorder(CHSI_State_Abbr,meanALE),meanALE))+
geom_bar(stat = "identity",fill='rosybrown')+
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
axis.title = element_text(size = 14),
plot.title = element_text(size = 18)
)+
xlab("States") +
ylab("Average Value") +
ggtitle(paste("Histogram Visualization for Average Life Expectancy"))
# All_Death: Mortality from any cause is the average annual rate of all causes of death.
ggplot(summary_measure_state, aes(reorder(CHSI_State_Abbr,meanAll_Death),meanAll_Death))+
geom_bar(stat = "identity",fill='rosybrown')+
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
axis.title = element_text(size = 14),
plot.title = element_text(size = 18)
)+
xlab("States") +
ylab("Average Value") +
ggtitle(paste("Histogram Visualization for All Death"))
ggplot(summary_measure_state, aes(reorder(CHSI_State_Abbr,meanHealth_Status),meanHealth_Status))+
geom_bar(stat = "identity",fill='rosybrown')+
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
axis.title = element_text(size = 14),
plot.title = element_text(size = 18)
)+
xlab("States") +
ylab("Average Value") +
ggtitle(paste("Histogram Visualization for Self-rated Health Status"))
# The average number of unhealthy days (mental or physical) in the past 30 days, reported by adults age 18 and older is provided,
ggplot(summary_measure_state, aes(reorder(CHSI_State_Abbr,meanUnhealthy_Days),meanUnhealthy_Days))+
geom_bar(stat = "identity",fill='rosybrown')+
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
axis.title = element_text(size = 14),
plot.title = element_text(size = 18)
)+
xlab("States") +
ylab("Average Value") +
ggtitle(paste("Histogram Visualization for Unhealthy Days"))
The main conclusion from bar chart plots:
Washington, D.C has the shortest ALE value while Hawaii state has the largest value of ALE. The variance of
ALEis quite small as it ranges from 72 years to 79.47 years.Plots for
Unhealthy DaysandAll Deathhave consistent finding where Hawaii has the smallest value. West Virginia has the largest value of unhealthy days, and Mississippi has the largest value of all deaths.However, the plot for self-rated
Healthy Statusshows interesting results where the states with a larger value ofUnhealthy Daystend to have a higher rating for theirHealth Status.
4.2 Measures of Birth and Death
For the measure of birth, we wanted to explore the correlation between Mother situations and Infant situations. The following scatter plot matrix shows some positive correlations.
library(extracat)
library(dbplyr)
measurebirth_df <- measurebirth %>% dplyr::select(new_LBW, new_VLBW,new_Premature,new_Infant_Mortality,new_IM_Wh_Non_Hisp,new_IM_Bl_Non_Hisp,new_IM_Hisp,new_Under_18, new_Over_40,new_Unmarried,new_Late_Care)
plot(measurebirth_df[,1:11],lower.panel = NULL,cex = 0.3, col ="rosybrown", main="Associations Between Mother Situation And Infant Situations")
library(plotly)
p <-
plot_ly(
data = measurebirth_df,
x = ~ new_Under_18,
y = ~ new_Premature,
marker = list(
size = 10,
color = 'rgba(255, 182, 193, .9)',
line = list(color = 'rgba(152, 0, 0, .8)',
width = 2)
)
) %>%
layout(
title = 'Mother Situation VS. Infant Mortality',
yaxis = list(zeroline = FALSE, title = "Percentage of Premature Birth"),
xaxis = list(zeroline = FALSE, title = "Percentage of Women Get Birth under 18")
)
p
Analysis:
From the Scatter plot matrix, we can clearly see that there is a strong positive relationship between women get to birth at under 18 years old and four unhealthy infant diseases including low birth weight
new_LBW, very low birth weightnew_VLBW, premature birthnew_Prematureand infant mortalityInfant_Mortality. The detail relationships between each one can be explored in the interactive components as follows.We can clearly see that the states have the higher percentage of women who are under 18 get birth to tend to have the higher percentage of premature birth, which indicates that young mothers (under 18) have negative impacts on the health of infants.
4.3 Measure of Preventive Diseases
library(tidyr)
preventive <-gather(preventive_df, key = variable, value = Value,
sumFluB_Rpt, sumHepB_Rpt, sumMeas_Rpt,sumPert_Rpt,sumCRS_Rpt,sumSyphilis_Rpt)
ggplot(preventive)+geom_bar(aes(x=reorder(variable, Value), y=Value), fill="rosybrown",stat ="identity" )+ggtitle("Bar plot of All Preventive Diseases (Reported cases)")+labs(x="Preventive Diseases", y= "Number of Cases")
Analysis:
- The Syphilis and Pertussis are the two diseases that have higher number of reported cases among other diseases.
- The number of Congenital Rubella Syndrome reported cases is really small (close to 0) across the US.
4.4 Risk Factors
We studied the geographical patterns of risk factors and the relationship between them.
For the obesity, which is our variable of primal risk factor here, we plotted its grayscale map to the county level and major cities has been marked out in the plot as red crosses.
library(maps)
risk_factor = risk_factor[ which(risk_factor$Obesity > 0),]
toFIPS = function(state, county) {
state = sprintf("%02d", state)
county = sprintf("%03d", county)
return(as.numeric(paste0(state,county)))
}
toZIP = function(state, county, ct) {
if (length(which(ct$STATE == state && ct$COUNTY == county)) == 0) {
return("-1")
}
return(ct[which(ct$STATE == state && ct$COUNTY == county), 'ZCTA5'])
}
plot_df = data.frame(region = vector(length = nrow(risk_factor)), value = vector(length = nrow(risk_factor)))
for (i in 1:nrow(risk_factor)) {
plot_df[i, "region"] = toFIPS(risk_factor[i, "State_FIPS_Code"], risk_factor[i, "County_FIPS_Code"])
plot_df[i, "value"] = gray(abs(risk_factor[i, "Obesity"] / max(risk_factor[,"Obesity"])))
}
maps::map("county", fill=TRUE, col=plot_df$value)
maps::map.cities(x = us.cities, country = "", label = NULL, minpop = 0,
maxpop = Inf, capitals = 2, cex = 2, projection = FALSE,
parameters = NULL, orientation = NULL, pch = 3,col="red")
It can be observed that
- the Southern states tend to be more obsessed than other parts of the US.
- people in the major cities seem to be more obsessed
We then study for the relationship between obesity and other factors.
risk$diabete = diabete$x
risk$few_fruit = few_fruit$x
risk$High_Blood_Pres = High_Blood_Pres$x
theme_dotplot <- theme_bw(18) +
theme(axis.text.y = element_text(size = rel(.75)),
axis.ticks.y = element_blank(),
axis.title.x = element_text(size = rel(.75)),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(size = 0.5),
panel.grid.minor.x = element_blank())
ggplot() + geom_point(data=risk,
aes(x = x,
y = fct_reorder(Abbr, x), color = "green")) +
geom_point(data=risk,
aes(x = no_ex,
y = fct_reorder(Abbr, no_ex), color = "red")) +
geom_point(data=risk,
aes(x = few_fruit,
y = fct_reorder(Abbr, few_fruit), color = "blue")) +
geom_point(data=risk,
aes(x = diabete,
y = fct_reorder(Abbr, diabete), color = "orange")) +
geom_point(data=risk,
aes(x = High_Blood_Pres,
y = fct_reorder(Abbr, High_Blood_Pres), color = "purple")) +
scale_colour_manual(name = 'Variables',
values =c("green"="green","red"="red", "blue" = "blue", "orange" = "orange", "purple" = "purple"),
labels = c("green"='Obesity Index', "red"='No-excercise Index', "blue" ='Few Fruit Index',"orange" = 'Diabete Index',"purple" = 'High Blood Pressure Index'),
breaks=c("green", "red","blue", "orange", "purple")) +
ylab("") + xlab("Index") + theme_dotplot
It can be seen that there exists a strong correlation among these health risk factors, and the Southern states ranked higher on this Cleveland plot.
4.5 Leading Causes of Deaths
We used pre-cleaned data processed by python. The cleaning script is located in out github repository. We first take a general view on the total number of deaths due to each cause. Since we take the sums of data over all the counties in the US, even though the number of deaths might be smaller than the actual one for causes with the smaller number of deaths, this should have little impact on their absolute ranking.
death_mosaic %>%
group_by(disease) %>%
summarise(deaths = sum(deaths)) %>%
ggplot() + geom_col(aes(x = fct_reorder(disease, deaths, .desc = TRUE), y = deaths), fill = "rosybrown") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 9),
axis.title = element_text(size = 12),
plot.title = element_text(size = 14)
) +
ggtitle("Number of Deaths Caused by Each Factor") + xlab("Death Cause") + ylab("Number of Deaths")
Heart disease as the first killer is not very surprising. However, this should still be a warning that people have to take actions to prevent heart disease. An unexpected fact is that suicide ranks the 4th among all the leading factors of deaths. This is a signal of bad mental health.
We then explore the geographical distribution of the number of deaths due to different causes. In this case, note that we may underestimate the death numbers in areas with lower population. Therefore, we mainly focus on those areas with significant high death rates due to a certain cause. While analyzing the deaths of causes in each state, we use the death rates of a cause per 100000 people in states to indicate the color in the graph to account for the population differences among states. After some exploration of the causes, we find three remarkable ones. The related graphs are in the Executive Summary section.
Browsing through all causes of deaths, we can see that there are two clusters in the US that have higher death rates among almost all death causes: one is around Mississippi and Alabama, while the other is around Wyoming and South Dakota. Besides, Alaska has the highest suicide rate. The reasons leading to this remain unclear for further investigation.